Harden NexusTokens for polymorphism with internal whitespace; add NexusTokensToInteger()#269
Merged
Conversation
…tespace
No parser fix was needed: the existing gsub(" ", "", ...) at parse_files.R:88
already strips internal whitespace from polymorphism tokens before NexusTokens()
sees them, so (1 2) -> (12) and {0 1} -> {01} already worked correctly.
Added:
- Regression test covering (1 2) / {0 1} polymorphism and multi-line matrix
continuation in test-parsers.R.
- NexusTokensToInteger(): a new exported helper that converts the character
matrix from ReadCharacters() to integer, mapping polymorphic/ambiguous/?/-
tokens to NA_integer_ by default, or extracting the first/last state digit
under polymorphism = "first"/"last". Tests included.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflicts from PR #268 (TNT parser fixes) landing on main while this branch was open: - DESCRIPTION: bump to 2.3.0.9002 (past main's 9001); drop duplicate Config/roxygen2/version line introduced by the merge. - NEWS.md: consolidate dev entries from both sides under the new header. - inst/extdata/tests/tnt-*.tnt, tests/testthat/test-ReadTntTree.R: keep main's versions; the branch's earlier deletions were premature. - man/*.Rd: regenerated via devtools::document() so @family references include both NexusTokensToInteger() (this branch) and the TNT additions from main. R/parse_files.R auto-merged cleanly. Full devtools::test() green.
Performance benchmark results
|
Performance benchmark results
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #269 +/- ##
=======================================
Coverage 96.07% 96.07%
=======================================
Files 80 80
Lines 5905 5968 +63
=======================================
+ Hits 5673 5734 +61
- Misses 232 234 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(1 2),{0 1}, including continuation lines). No parser fix was needed:gsub(\" \", \"\", ...)at R/parse_files.R:88 already strips internal whitespace beforeNexusTokens()sees the tokens, so(1 2)becomes(12). The new test in tests/testthat/test-parsers.R locks that behaviour down.NexusTokensToInteger()(R/parse_files.R:887): converts the character matrix fromReadCharacters()(or a vector fromNexusTokens(), or aphyDatobject) to an integer matrix.polymorphism = c(\"?\", \"first\", \"last\")— default\"?\"treats polymorphic tokens like NEXUS missing data (→NA_integer_).\"first\"/\"last\"take the corresponding state digit inside the brackets.phyDatinput is routed throughPhyDatToMatrix(ambigNA = TRUE, inappNA = TRUE)so fully-ambiguous and inapplicable rows becomeNA_integer_and only true partial polymorphism is subject to thepolymorphismrule.0..9are recognised; non-digit symbols (and any token whose interior contains no digits) map toNA_integer_. Documented in@details.regmatches(x[ambig], regexpr(...)), which silently dropped no-match elements and crashed (or recycled wrongly) on tokens like\"(AB)\"or\"()\". Replaced with a length-preserving pattern that pads via them != -1Lmask. Regression test added.test-ReadTntTree.Rtest cases that no longer exercise live code paths after the parser simplification on this branch.man/*.Rdfiles.2.3.0.9000(development) with a NEWS entry.Test plan
devtools::document()regeneratesNAMESPACE(export added at line 375) andman/NexusTokensToInteger.Rd.devtools::test(filter = \"parsers\")— all parser tests pass, including the new Cingulata block, the round-trip fromReadCharacters()intoNexusTokensToInteger(), the no-digit polymorphism regression, the named-vector input case, attribute preservation forstate.labels, and thephyDat→ integer routing.devtools::test()— no regressions. Skipped tests are pre-existing slow-only.R CMD checklocally if you'd like full CRAN-grade assurance before merging.🤖 Generated with Claude Code